Language identification based on n-gram frequency ranking

نویسندگان

  • Ricardo de Córdoba
  • Luis Fernando D'Haro
  • Fernando Fernández-Martínez
  • Javier Macías Guarasa
  • Javier Ferreiros
چکیده

We present a novel approach for language identification based on a text categorization technique, namely an n-gram frequency ranking. We use a Parallel phone recognizer, the same as in PPRLM, but instead of the language model, we create a ranking with the most frequent n-grams, keeping only a fraction of them. Then we compute the distance between the input sentence ranking and each language ranking, based on the difference in relative positions for each n-gram. The objective of this ranking is to be able to model reliably a longer span than PPRLM, namely 5-gram instead of trigram, because this ranking will need less training data for a reliable estimation. We demonstrate that this approach overcomes PPRLM (6% relative improvement) due to the inclusion of 4gram and 5-gram in the classifier. We present two alternatives: ranking with absolute values for the number of occurrences and ranking with discriminative values (11% relative improvement).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language Identification of Short Text Segments with N-gram Models

There are many accurate methods for language identification of long text samples, but identification of very short strings still presents a challenge. This paper studies a language identification task, in which the test samples have only 5–21 characters. We compare two distinct methods that are well suited for this task: a naive Bayes classifier based on character n-gram models, and the ranking...

متن کامل

A Comparative Study on Language Identification Methods

In this paper we present two experiments conducted for comparison of different language identification algorithms. Short words-, frequent wordsand n-gram-based approaches are considered and combined with the Ad-Hoc Ranking classification method. The language identification process can be subdivided into two main steps: First a document model is generated for the document and a language model fo...

متن کامل

Experiments with N-Gram Prefixes on a Multinomial Language Model versus Lucene's Off-the-shelf Ranking Scheme and Rocchio Query Expansion (TEL@CLEF Monolingual Task)

We describe our participation in the TEL@CLEF task of the CLEF 2009 ad-hoc track, where we measured the retrieval performance of LGTE, an index engine for Geo-Temporal collection which is mostly based on Lucene, together with extensions for query expansion and multinomial language modelling. We experiment an N-Gram stemming model to improve our last year experiments which consisted in combinati...

متن کامل

Language Identification from Text Using N-gram Based Cumulative Frequency Addition

This paper describes the preliminary results of an efficient language classifier using an ad-hoc Cumulative Frequency Addition of N-grams. The new classification technique is simpler than the conventional Naïve Bayesian classification method, but it performs similarly in speed overall and better in accuracy on short input strings. The classifier is also 5-10 times faster than N-gram based rank-...

متن کامل

Detection of Foreign Entities in Native Text Using N-gram Based Cumulative Frequency Addition

This paper describes a logarithmic version of the conventional Naïve Bayesian N-gram-based, textclassification algorithm that we name Cumulative Frequency Addition (CFA) and its application in three tasks: language identification, nationality identification from names, and detection of foreign words in base text. The new CFA technique is 3-10 times faster than N-gram based rank-order statistica...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007